Lesson 7. Crossref

Crossref is a nonprofit organization that manages a registry of Digital Object Identifiers (DOIs). Publishers collaborate with Crossref to assign a unique DOI to each journal article, book, conference paper, or dataset they publish. This DOI acts like a permanent web address, enabling seamless linking between references, citations, research outputs, funding information, and more.

The Crossref REST API offers free access to the nonprofit’s metadata. This tutorial introduces two useful tools: JSON, a simple data format that resembles Python dictionaries and is easy to read and use, and Python’s built-in logging module.

Data skills | concepts

  • APIs
  • logging
  • JSON data

Learning objectives

  1. Interpret documentation and apply concepts to write functional code.
  2. Extract and work with JSON data using Python’s built-in tools.
  3. Use Python’s logging module to capture and report errors that interrupt code execution.

This tutorial is designed to support multi-session workshops hosted by The Ohio State University Libraries Research Commons. It assumes you already have a basic understanding of Python, including how to iterate through lists and dictionaries to extract data using a for loop. To learn basic Python concepts visit the Python - Mastering the Basics tutorial.

LESSON 7

Crossref

Crossref provides detailed documentation and a wide range of robust learning resources to help users effectively work with its REST API.

JSON

Crossref queries return data in JSON format, which is easy to read and looks similar to Python dictionaries. You can work with JSON data by looping through its key-value pairs to access the information you need.

Read through the Crossref REST API documentation. Then …

  1. Read data/dois.csv into a Pandas DataFrame
  2. Use the Crossref works API to gather the following fields for each DOI:
    • publisher
    • article_title
    • journal_title
    • journal_abbr
    • year
    • reference count

import requests
import pandas as pd

def lookup(target_doi):
    base_url='https://api.crossref.org/works/'
    url=base_url+target_doi
    response=requests.get(url)
    response.raise_for_status() #Raise an HTTP Error for bad responses
    json_data = response.json() #Parse JSON response
    return json_data

file=pd.read_csv('C:/Users/murphy.465/Documents/GitHub/data_visualization/data/dois.csv')
dois=file.doi.tolist()
results=pd.DataFrame(columns=['doi','publisher','article_title','journal_title','year','reference_count'])

for doi in dois:
    data={}
    response=lookup(doi)
    entry=response['message']
    data['doi']=doi
    data['publisher']=entry['publisher']
    data['article_title']=entry['title'][0]
    data['journal_title']=entry['container-title'][0]
    data['year']=entry['published']['date-parts'][0][0]
    data['reference_count']=entry['reference-count']
    row=pd.DataFrame(data, index=[0])
    results=pd.concat([row,results], axis=0, ignore_index=True)

Logging

APIs sometimes return error codes which interrupt our program’s execution. Logging tells Python how to handle these errors. It can also help to identify issues with your code.

Tip - Copilot

Ask Copilot how to handle exceptions in logging module. Copilot will return code you can modify for your project and provide additional tips.

Modify code from Exercise 1 to add a function that logs and handles HTTP Errors for bad responses.

import requests
import pandas as pd
import logging
import time

#  Configure logging
formatstring="%(asctime)s - %(levelname)s - %(message)s"
datestring="%m/%d/%Y %I%M%S %p"
logging.basicConfig(filename="cr_errors_find_dois.log", level=logging.ERROR, format=formatstring, datefmt=datestring)

# Define function to request url and log HTTP errors
def lookup(target_doi):
    try:
        base_url='https://api.crossref.org/works/'
        url=base_url+target_doi
        response=requests.get(url)
        response.raise_for_status() #Raise an HTTP Error for bad responses
        json_data = response.json() #Parse JSON response
        return json_data
    except requests.exceptions.HTTPError as http_err:
        logging.error(f"HTTP Error = {http_err}") # Log the HTTP error
        time.sleep(10)
    except Exception as err:
        logging.error(f"Other error = {err}") #Log any other errors
        time.sleep(10)
        
file=pd.read_csv('C:/Users/murphy.465/Documents/GitHub/data_visualization/data/dois.csv')
dois=file.doi.tolist()
results=pd.DataFrame(columns=['doi','publisher','article_title','journal_title','year','reference_count'])

for doi in dois[0:2]:
    data={}
    response=lookup(doi)
    entry=response['message']
    data['doi']=doi
    data['publisher']=entry['publisher']
    data['article_title']=entry['title'][0]
    data['journal_title']=entry['container-title'][0]
    data['year']=entry['published']['date-parts'][0][0]
    data['reference_count']=entry['reference-count']
    row=pd.DataFrame(data, index=[0])
    results=pd.concat([row,results], axis=0, ignore_index=True)